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Abstract: Concurrent programming is notoriously difficult, especially in constrained embedded 
contexts. Threads, in particular, are wildly nondeterministic as a model of computation, and 
difficult to analyze in the general case. Fortunately, it is often the case that multi-threaded, 
semaphore-synchronized embedded software implements high-level functional specifications written 
in a deterministic data-flow language such as Scade or (safe subsets of) Simulink. We claim that 
in this case the implementation process should build not just the multi-threaded C code, but (first 
and foremost) a richer model exposing the dataflow organization of the computations performed by 
the implementation. From this model, the C code is extracted through selective pretty-printing, 
while knowledge of the data-flow organization facilitates analysis. We propose a language for 
describing such implementation models that expose the data-flow behavior (the sheep) hiding 
under the form of a multi-threaded program (the wolf). The language allows the representation 
of efficient implementations featuring pipelined scheduling and optimized memory allocation and 
synchronization. We show applicability on a large-scale industrial avionics case study and on a 
commercial many-core. 

Key-words: synchronous languages, Kahn process networks, execution platform, semantic 

preservation, implementation model, multi-thread, parallelisme, Lustre, Scade 
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Brebis deguisees en loups : Modeles d’implementation pour 
systemes paralleles flots de donnees 

Resume : La programmation concurrente est une discipline difficile, particulierement dans 
un contexte de systemes embarques. Les threads, en particulier, sont un modele de calcul non- 
deterministe et difficile a analyser dans le cas general. Heureusement, les logiciels embarques 
multi-threades synchronises par semaphores sont souvent des implementations de specifications 
fonctionnelles de haut niveau ecrites dans un langage flot-de-donnees deterministe comme Scade 
ou (un sous-ensemble sur de) Simulink. Dans ce cas, le processus d’implementation devrait, non 
seulement construire le code C multi-threade de l’implementation, mais avant tout un modele 
plus riche exposant l’organisation du flot-de-donnees des calculs effectues par le code. De ce 
modele, le code C peut etre extrait par du simple pretty printing. En meme temps, la structure 
flot-de-donnee facilite l’analyse. Nous proposons un langage pour la description de tels mod¬ 
eles d’implementations qui exposent le comportement flot-de-donnee (la brebis) deguise en un 
programme multi-threade (le loup). Ce langage permet une representation d’implementations ef- 
ficaces avec ordonnancement pipeline et allocation memoire et synchronisations optimisees. Nous 
montrons son application sur un cas d’etude de l’industrie aeronautique et sur une plateforme 
many-coeurs commerciale. 

Mots-cles : langages synchrones, reseaux de Kahn, plateforme d’execution, preservation se- 
mantique, modele d’implementation, multi-thread, parellelisme, Lustre, Scade 



Implementation models for data-flow multi-threaded software 


3 


Contents 


II Introduction! 



1.1 Motivating example . 


1.2 Related work anc originality | 



\2 

Modifications to Lustrel 


2.1 Synchronous semantics principles. 

2.2 Kahnian asynchronous interpretation! . 


2.2.1 Preliminary notations. 

2.2.2 Program state| . 

2.2.3 Semantic rules and semantics preservation! . 

2.2.4 Properties ensuring implementability. 


Ing] 


3.3.1 State representation! . 

3.3.2 Semantic rulesl . 

3.3.3 Correctness and semantics preservation 


|4 Experimental evaluation! 
15 Conclusion! 


|3 Implementation mode 


3.1 Target architectures 


"ITLl^^pT'ancr^B 1 


3.2 Mapping informatioi 


3.3 Platform semantics! 


|A Synchronous semantics of InteLus] 

A.l State representation. 

A.2 Notations! . 

A.3 Structural operational semantics rules| 

A.4 Clocksl . 

A.5 Equation guards| . 


4 

4 

7 

7 

11 

11 

11 

12 

13 

14 

14 

14 

15 
15 
18 
18 
20 
21 

21 

22 

23 

23 

23 

23 

25 

27 


RR n° 9057 























































































4 


Didier et al. 


1 Introduction 

Mastering concurrency is difficult, and yet hardware design resolutely moves towards increasingly 
massive parallelism based on the use of chip-multiprocessors. Threads [B] are one of the major 
programming paradigms for such multi- and many-core systems. They arguably provide the best 
portability and the finest control over resource allocation, which are both essential in the design 
of embedded applications that need to get the best guaranteed performance out of resource- 
constrained hardware. 

But such expressiveness comes at a price. As a model of computation, threads are wildly 
non-deterministic and non-compositional [S|, making both programming and formal analysis m 
difficult. This explains why multi-threaded software is often bug-ridden even in the context of 
critical systems 0. 

But there are also good news: in many industrial contexts (avionics, automotive, etc.) the use 
of threads is tightly controlled. We consider in this paper the particular case where the functional 
specification of the system is done in a synchronous dataflow language such as SimulinkQor Scade 
[^] In this case, multi-threaded implementations have particular structure and properties: The 
number of threads is fixed, each one implementing a recurrent task obtained by sequencing a 
fixed set of dataflow blocks (or parts thereof, obtained by parallelization). 

Such hypotheses, when taken individually, largely facilitate the formal analysis of multi¬ 
threaded systems. But in many cases, the multi-threaded implementation preserves a funda¬ 
mentally dataflow structure , with specific rules on the way platform resources (shared memory, 
semaphores) are used. When this happens, the implementation is best represented as a dataflow 
synchronous program whose elements are mapped on the platform resources. Ensuring the cor¬ 
rectness of such an implementation consists in ensuring that: 

1. The dataflow program (without the mapping) implements the semantics of the functional 
specification. This analysis can be performed inside the dataflow model. 

2. Once the mapping of program elements onto the platform resource^ is performed, the exe¬ 
cution of the platform (under platform semantics) implements the behavior of the dataflow 
program. 

Together, the dataflow program and the mapping information form an implementation model. 
This model is strictly richer than the multi-threaded C code, which can be obtained through a 
pretty-printing of model parts. Exposing the internal data-flow structure of the implementation 
facilitates defining and establishing correctness, e.g. the correctness of the synchronization or 
memory coherence protocols synthesized during the implementation process. All analyses can be 
realized using efficient tools specific to the synchronous model. Finally, if manual inspection of 
the C multi-threaded code is required, such a representation can be used to enforce strict code 
structuring rules which facilitate understanding. 

1.1 Motivating example. 

Mapping being necessarily platform-dependent, we shall consider in this paper shared memory 
multi-core platforms with a specific memory organization, detailed in Section |3.1| The example 
in Fig. [T] provides a simple dataflow program, a very simple C implementation with two threads, 
and the corresponding implementation model (in the middle column). 

'https://www.mathworks.com/products/simulink.html 

2 http://www.esterel-technologies.com/products/scade-suite/ 

3 Sequencing of blocks into threads executed by processors; code, stack and data variables to memory locations; 
synchronizations to semaphores, etc. 
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fun Prod:()->(int) 
fun Cons:(int)->() 
var x:int 
let 

(x) = Prod( ) 

( ) = Cons(x) 
tel 


(a) 


fun Prod:()->(int) at 0x20100 


fun Cons:(int)->() at 0x30200 


2 


var x:int at 0x22000 on cpuO 


3 

extern int x ; 

x_cpul:int at 0x22000 on 

cpul 

4 

extern int x_cpul ; 

x_ram:int at 0x22000 


5 

_attribute_((sectionC . oncpuO"))) 

u:event ul:event v:event 

vl:event 

6 

void initO { signal(lock_l) ;} 

p:event q:event r:event s 

:event 

7 


let 


8 

_attribute_((sectionC • oncpuO"))) 

thread on cpuO at 0x20000 stack 0x30000 

9 

void thread_cpu0() { 



10 

initO ; global_barrier() ; for(;;) 

top done(q) [wait:sem_l] 

(_) = (vl) 

11 

wait(lock_l) ; 

top wait(q) 

(x) = Prod( ) 

12 

Prod(&x) ; 

top done(p) [flush:0x22000] 

(x_ram) = (x) 

13 

flush(&x) ; 

top wait(p) [signal:sem_0] 

(u) = top 

14 

signal(lock_0) ; 



15 

» 



16 

_attribute_((sectionC • oncpul"))) 

thread on cpul at 0x30000 stack 0x40000 

17 

void thread_cpul() { 



18 

global_barrier() ; for(;;) { 

top done(r) [wait:sem_0] 

(_) = (ul) 

19 

wait(lock_0) ; 

top wait(r) [inval:0x22000] 

(x_cpul) = (x_ram) 

20 

inval(&x_cpul) ; 

top done(s) 

( ) = Cons(x_cpul) 

21 

Cons(&x_cpul) ; 

top wait(s) [signal:sem_l] 

(v) = top 

22 

signal(lock_l) ; 



23 

» 

semaphore sem_0 (u) (ul) init 

false on lock_0 

24 

ldscript fragment: 

(ul) = (u) 

25 

x=0x22000; x_cpul=0x22000; 

semaphore sem_l (v) (vl) init 

true on lock_l 

26 

stack0=0x30000; stackl=0x40000; 

( 

vl) = top fby (v) 

27 

.=0x20000; .bank2:{*(.oncpuO); 

tel 


28 

. = 0x100 ; *(.Prod) 



29 

.=0x30000; .bank3:{*(.oncpul); 



30 

. = 0x200 ; *(.Cons) 


;> 

;> 


(b) (c) 

Figure 1: Simple producer-consumer specification (a), two-thread implementation (c), and the 
corresponding implementation model (b) providing both the data-flow model of the implemen¬ 
tation (in black) and the mapping information allowing execution on the HW platform (in red). 
Line numbering is common between (b) and (c). 
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The specification program follows a simplified Lustre [4] syntax, presented in Section [2j and 
defines a simple producer-consumer application with a single communication variable x. The C 
implementation is much more complex. While the production and consumption function calls 
can still be identified, they are surrounded by calls whose function is to ensure correct execution 
on the asynchronous multi-core platform: 

• Semaphore wait and signal API calls ensure that production happens before consumption, 
and that consumption is completed when a communication variable is reused for production. 

• Data cache dflush and dinval API calls implement the memory coherency protocol en¬ 
suring that the consumer uses the correct data. 


Note that the multi-threaded implementation consists not only of C code, but also comprises 
GCC annotations and the linker script which allow the definition of the memory mapping. 

Such tightly-controlled mapping is often employed in critical embedded systems. In avionics 
applications like our case study, the worst case execution time must be demonstrated for nor¬ 
mal conditions, but the application must also be robust to “external factors”. The choice of a 
semaphore-synchronized implementation helps with the second objective, largely improving ro¬ 
bustness by essentially guaranteeing the respect of the functional semantics. Achieving the first 
objective can then be done through tight control of memory allocation and synchronization, and 
through the use of hardware with good support for timing predictability (hardware semaphore 
implementation, no shared caches, etc.). These design choices, not covered in this paper, reduce 
timing variability and facilitate timing analysis. 

As explained above, the implementation model of Fig.JlJb) consists in a dataflow program (in 
black) extended with annotations defining all the aspects of its mapping (in red). The dataflow 
program uses some extensions to Lustre allowing the description of synchronization. These ex¬ 
tensions, defined in Section [2] include the synchronization data type event and the wait and 
done constructs that allow the definition of sequencing constraints not implied by data variables. 
The dataflow implementation program, endowed with Kahnian asynchronous semantics (defined 
in Section 2.21 provides a precise functional model of the execution on platform. For instance, 
specification variable x is replaced here with 3 variables x, x_cpul, and x_ram allowing the repre¬ 
sentation of the various states of the memory system where the value produced on processor cpuO 
has not yet been propagated to the RAM or to the cache of processor cpul. The implementation 
model also provides dataflow interpretations for the various API calls. For instance, in line 13 
the dataflow interpretation of dflush ensures that the local value of x has been propagated onto 
its RAM counterpart x_ram, and in line 14 the dataflow interpretation of signal produces a 
token (the special literal top) that can be consumed later by a wait call. The equations in lines 
25 and 27 are not part of threads. They provide the semantics of platform semaphores. 

Mapping information defines the construction of sequential threads and their allocation to 
processors, the allocation of all code, stack and data to memory, and the implementation of 
interprocessor synchronization with semaphores that are allocated onto hardware locks. Note 
that the C code is obtained through a pretty printing of implementation model elements. For 
instance, there exists a line-to-line correspondence between equations of a thread in the imple¬ 
mentation model and function calls in the infinite loop of the corresponding C thread. Also note 
that an explicit resource allocation is needed to allow the representation of efficient implemen¬ 
tations. The initial source code could be given a translation into multi-treaded code, but such a 
translation would ignore (optimized) resource allocation issues. 


Outline. The remainder of the paper will detail the elements of our implementation language, 
and show that it can be used to represent efficient implementations of large-scale applications. 
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Section |1.2| presents related work. Section [2] presents the modifications we bring to the Lus- 
tre/Scade language, introduces its less commonly known Kahnian asynchronous semantics, and 
presents an efficient implementation of our simple example. Section [3] defines the syntax and 
(platform) semantics of the implementation modeling language. Section [4] presents our experi¬ 
mental results, and Section [5] concludes. 

1.2 Related work and originality 

Most work on parallel application mapping (e.g. EH]) involves a code generation phase that 
escapes formal analysis, covering at least some of the aspects we consider here: construction of 
threads, synthesis of synchronization and memory consistency protocols, memory allocation, etc. 
From this perspective, our work is an attempt at uniform formal modeling of this phase, which 
dissociates platform-independent aspects from platform-dependent ones, and which currently 
covers shared memory, semaphore-synchronized platforms. 

Our formalization effort can also be seen as the first step towards formal proof of correctness 
for this code generation phase. This is similar to previous work on formally proved compilation 
m , on static analysis of parallel software m , and on rigorous systems design using BIP [3] • The 
main difference is that we remain solidly attached to a dataflow, deterministic semantic model 
even in the description of implementations. We expect this choice will simplify the specification 
and proof of correctness results. 

Previous work on the compilation and the translation validation for the Signal language 
ma m also maintains an end-to-end dataflow formalization. By comparison, our approach 
extends dataflow modeling to cover low-level multi-processor implementation issues (semaphore 
synchronization, memory consistency, construction of threads). 

The objective of reducing the semantic distance between specification and implementation 
is also covered in Ej. The difference is the dataflow model with simpler control structure that 
we consider, and the fact that we consider aspects such as synchronization, memory consistency, 
and memory allocation. 

From a more classical modeling point of view our work does not aim for the generality of 
UML/MARTE |12j . but rather to provide a specific solution to the problem of correct multi¬ 
thread implementation of dataflow synchronous specifications. In this, it joins previous work 
that enriches dataflow languages with annotations describing non-functional requirements (2j- 


2 Modifications to Lustre 

This section introduces a dataflow synchronous language for system-level functional specification 
and for defining the dataflow part of implementation models. We could define this language as a 
strict extension of Lustre (i) or Scade with new constructs for our new modeling needs. However, 
the system-level and implementation-oriented perspective makes some major features of these 
languages unneeded, and including them would only pollute our presentation. For this reason, 
we remove them. We shall not insist here on the syntax and semantics of Lustre, which has been 
covered elsewhere. Instead, we focus on the various modifications. 

We call the new language InteLus, for Inte gration Lus tre, and its syntax is provided in the 
black and blue parts of Fig. [2] Unlike Lustre and Scade, which also allow the programming of 
the sequential tasks of an embedded system, InteLus is only designed to allow the system-level 
integration of these tasks. For this reason, InteLus does not have a module system. One aspect 
of this is that a full-fledged interface definition is not needed at system level. We only identify 
input variables, which intuitively correspond to memory-mapped input devices. 
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<prog> 

<type> 

<fun> 

<arg> 

<var> 

<tid> 

<ceq> 

<eg> 


<sexpr> 

<k> 

<lit> 

<lval> 

<list(X)> 

<nvlist(X)> 

<wait> 

<done> 

<guard> 

<alloc> 

<varalloc> 

<thread> 

<teq> 

<capi> 

<api> 


<sem> 


\\= <type>* <fun>* var <var>* let <ceq>* <thread> + <sem>* tel 
type <id> size <int> 

::= fun <id> ( <list(<arg >)>) -> ( <list(<arg >)>) <alloc >? 

::= state? <tid> 

::= input? <id> : <tid> <varalloc>l 
::= bool | int | <id> | event 
::= <guard >? <wait >? <done >? <eq> 

::= ( <nvlist(<lval >)>) = ( <nvlist(<sexpr >)>) 

| (<Esf(<foaZ>)>) = <fid>(<list(<sexpr >)>) 

| (<id>) = <fc> fby <sexpr> 

| (<id>) = <sexpr> when <id> 

| (<id>) = merge <id> <sexpr> <sexpr> 

::= <fc> | <id> 

::= <Zit> | <fid>() 

::= true | false | <int> \ top 
::= <id> | _ 

::= | <nvlist(X)> (list meta-rule) 

::= X | <nvlist(X)> ,X (non-void list meta-rule) 

::= wait (<Zist(<id>)>) 

::= done (.<list(<id>)>) 

::= top | <id> \ <guard> <wait >? <done>7 on <id> 
at <addr> 

::= <alloc> <cpuid >? 

::= thread on <cpuid> <alloc> stack <addr><teq>* 

::= <ceq> \ <capi> 

<guard> <wait>7 <done>l <api> 

::= [wait:<*d>] (<lval>) =(<id>) 

| [signal : <*d>] ( <id >) = ( <sexpr >) 

| [dinval : <addr>~\ ( <nvlist(<id >)>) = ( <nvlist(<id >)>) 

| [df lush: <addr>~\ ( <nvlist(<id >)>) = ( <nvlist(<id >)>) 

::=semaphore <id> ( <nvlist(<id>)>) ( <nvlist(<id >)>) 
on <lockid> <eq> + 


Figure 2: Lustre language subset (in black). InteLus extensions (blue). Extension with mapping 
information (red). 


Inria 







Implementation models for data-flow multi-threaded software 


9 


Furthermore, InteLus assumes that all sequential tasks have already been built, taking the 
form of sequential function^] called from the InteLus program (like Prod and Cons in Fig. S a > b ))- 
All memory used by the functions must be exposed to the memory allocation algorithms. For this 
reason, functions cannot contain static variables. To allow the representation of stateful tasks, the 
state must be exposed as one or more dataflow variables transmitted from one cycle to the next 
using fby constructs. To allow efficient memory allocation of such variables, function inputs and 
outputs may be tagged with the state keyword, to identify inputs that are passed through refer¬ 
ence, modified in place and passed on as outputs. Inputs and outputs marked with state must be 
paired and placed at the beginning of the input and output argument lists. The following function 
declaration features one state variable: fun myfun (state int, int)-> (state int, bool) 

InteLus incorporates the event type of Signal m, with the difference that its only value 
is here top, and not true. Variables of this type carry no information, representing pure syn¬ 
chronization. In Fig. 0b) we use them to provide the dataflow interpretation of semaphore 
operations. We also use them to define control dependencies between equations that do not ex¬ 
change data. This is done through the use of the novel wait and done constructs. When placed 
in front of an equation, wait(si,..., s^,) will delay the start of the equation until a top value (a 
token) can be read on each of the event type variables Si, ..., Sk- When placed in front of an 
equation, done(si, ..., Sk ) will write top on each of the variables si, ..., Sk after the completion 
of the execution of the equation. 

Sequencing dataflow equations to build threads running in an asynchronous environment 
requires a normalization phase. For each equation of a thread, this phase builds a guard, which 
is the (possibly empty) cascade of tests needed to determine, at each execution cycle where the 
control reaches the equation, if the equation is executed or not, thus allowing giving control in 
sequence^] Guards are placed in front of equations. A guard always starts with a variable of type 
event (or the literal top) which identifies the trigger event of the cascade of tests. We require 
that all equations of a thread have the same trigger. It represents the event triggering iterations 
of the thread body. When the trigger is top, no external event is needed to trigger iterations, 
and the thread body is an infinite loop (like in our example). Triggers different from top allow 
the representation of interrupt-driven tasks (not present in our examples). 

In each guard, the trigger is followed by a sequence of tests, and synchronizations. The 
test of Boolean variable C is represented with “on C”. The constructs wait and done allow 
the definition of control dependencies at all levels of guard decoding. Consider the following 
equation: u on x wait (a) done(b) on y (z) = f(t) 

At each cycle where u is present and x=true, a top value is waited for on variable a after the 
test on x was performed and before the test on y is executed. During the same cycles, a top 
value is written on variable b when the execution of the equation is completed, if y=true, or 
just after the test on y, if y=false. Note how, unlike previous uses of guards in synchronous 
languages DU, our guards focus on synchronization, clearly representing the flow of control from 
trigger to cascade of tests and synchronizations, until passing control in sequence. 

An InteLus program featuring non-trivial guards is provided in Fig. [3] (the black, dataflow 
part). This program is an optimized, pipelined implementation of the producer-consumer spec¬ 
ification of Fig. 0 a )- It contains several features not present in the non-optimized program 
of Fig. 0b). The Boolean variable c is tested by guards to allow the incremental activation 
of equations during the prologue of the pipelined implementation. Software pipelining requires 
some memory replication, to allow Prod and Cons to work in parallel on different copies of the 
specification variable x. The copy used by Prod is located at address 0x20500, and that used by 

4 Compiled from traditional Lustre nodes, or other C functions. 

5 The decision not to execute is particularly valuable in an asynchronous environment where absence cannot 
be detected. 
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fun Prod:()->(int) at 0x20100 
fun Cons:(int)->() at 0x30200 

var x:int at 0x20500 on cpuO xl:int at 0x30500 on cpuO 
xl_ram:int at 0x30500 xl_ram_fby:int at 0x30500 
x2:int at 0x30500 on cpul x2_ram:int at 0x30500 
c:bool at 0x30504 on cpul 

i:event j:event k:event 1:event m:event n:event o:event p:event 
q:event v:event vO:event vl:event u:event ul:event 

let 

(1) = top fby k 
(q) = top fby p 
(xl_ram_fby) = 0 fby xl_ram 
(x2_ram) = xl_ram_fby when c 
thread on cpuO at 0x20000 stack 0x30000 


top wait(l) 

(x) 

= ProdO 

top done(i) 

[wait: si] (_) 

= (ul) 


top wait(i) 

(xl) 

= (x) 


top done(j) 

[flush: 0x20504] (xl_: 

ram) = 

(xl) 

top wait(j) done(k) 

[signal:sO] (v) 

= top 


thread on cpul at 0x30000 stack 0x40000 



top wait(q) done(m) 

on c [wait:s0] 

(_) = 

(vl) 

top wait(m) 

on c [inval:0x20504] 

(x2) = 

(x2_ram) 

top done(n) 

on c 

0 = 

Cons(x2) 

top wait(n) done(o) 

[signal: si] 

(u) = 

top 

top wait(o) done(p) 


(c) = 

false fby true 

semaphore sO (v) (vl) 

init false on lock_0 



(vO) = top fby v 




(vl) = vO when c 




semaphore si (u) (ul) 

init true on lock_l 



(ul) = top fby u 





tel 


Figure 3: Efficient pipelined implementation of our example 
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Cons at address 0x30500. The copy from one location to another is realized at each cycle by the 
equation in line 20. 

Fig. | also features equations not assigned to threads or semaphores. These are equations 
that are needed for the completion of the dataflow model, but whose semantic action will either 
be implemented by the sequencing of threads (for lines 13 and 14) or whose semantic action 
will not require encoding with platform operations. For instance, the equations in lines 15 and 
16 need no platform operation due to the allocation of input and output variables on the same 
memory locations. 

Finally, notation as an lvalue for equation output values of type event whose value is not 
needed. This notation reduces the number of variables. 

2.1 Synchronous semantics principles 

The semantics of InteLus is close to that of Lustre, which is well covered in existing literature [3J 
E . For this reason, and due to space restrictions, we will provide the full synchronous semantics 
of InteLus in a separate report. We provide here only the principles of the formalization, used 
next. 

The semantics of InteLus statements and programs is described through structural operational 
semantics (SOS) rules used to derive transitions of the form p' =§> p". Here, p' and p” are terms 

describing the input and output state of some statement or program p, I is a valuation of all 
input variables of the statement/program, and O is a valuation of all variables that are not 
inputs. The transition defines the behavior of p for one execution cycle. 

State terms associated to a statement or program p describe the values currently stored by 
f by equations. They do so by modifying the constants of f by equations directly in the text of p. 
The initial state is p (with the initial fby constants). Values assigned by I and O to variables 
are either of the type of the variable, or can be the special value nil representing the absence of 
a value. If V is a set of variables, we denote with TZy the set of all possible assignments of the 
variables in V. For r £ 7 Zy and v £ V we denote with r(v) £ Type{y) U{nil} the value of v in r. 

Given a program p , an execution trace is any sequence of transitions starting in the initial 
state: p =^> pi p n . We say that a program is correct if a trace exists for every sequence 

I\ 12 In 

7i ,..., I n assigning values different from nil to every input variable. We denote the set of 
all traces of program p with Traces(p). Given the determinism of the synchronous semantics, 
Traces (p) can be seen as a sub-set of IZy , where V is the set of variables of p. 

The full semantics is described in Appendix [A] 

2.2 Kahnian asynchronous interpretation 

The determinism of the InteLus dataflow equations means that we can endow InteLus programs 
with asynchronous Kahn process networks (KPN) [7j semantics without losing determinism. 
This asynchronous interpretation is important in our case, given the objective of asynchronous 
multi-processor implementation. It puts into evidence well-formed properties allowing imple¬ 
mentation on the target execution platforms. A strong semantics preservation property links 
the asynchronous interpretation of an InteLus program to its synchronous semantics, and its 
small-step operational description facilitates the link with the platform semantics of Section |3.3| 
This facilitates the formal characterization of implementation correctness. 

2.2.1 Preliminary notations 

In our semantics, program equations are deterministic Kahn processes communicating through 
infinite lossless FIFO channels corresponding to the variables. The semantics is asynchronous, so 
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£ 1, TTL . adv (h, ~Yi , k'j ( 3 * 1 , - - ■ , 3 .n) Ffid 1)7 • • • 7 i^fc(l?n)) 


< (X l5 ...,X n )m fid(Y k , ..., y*), h »->< (AT,..., A„) = fid(Y * + \..., y^ 1 ), M((AV ^ Xi 

adv(h, Y, n) 


i = 1 , n}) > 


< A' = k fby y™, h »-)■< A = /i„(y) fby y n+1 , h5({X ^ h n (Y))) > 

adv(h,C,m ) h m (C) = false adv(h,X,m) 


(fby) 


<C y = A m when C" 
adv(h, C, m) 


, /i »->< y = A m+1 when C' m+1 , h > 
h m (C) = true adv(h, X, m) 


(when-) 


< y = X™ when C m , /l y = X™+! when C7 m + 1 , ft<5((y >->• h m (X))) > 

adv(h,C,m) h m (C) = true adv(h,X,n ) 

< Z = merge C m X n Yp, h »->< Z = merge C m+1 X" +1 Y*, h5((Z h„(X))) > 

adv(h,C,m) h m {C) = false adv(h,Y,p) 

< X = merge C m X n yp, h »->< Z = merge C m+1 X n 1> +1 , hS((Z i-A- h p (Y))) > 

»—HC p', M(*) » Vi : adv(h, Si, k) Mi,j : (x(Sj) = x(Tj) = nil) A (S) 7 ^ Tj) 

< r(p, fc), h »—>-C r(p', k + 1), M(® U (T< >->■ top | i = 1, n)) > 
adv(h,C,k) h m (C) = true eq, h eq 1 , h8(r) ^$> 


(when+) 

(merge-)-) 
(merge-) 

(wait-done) 


•C on C k eq, h on C*^ 1 eq', h5(r) ( on +) 

adv(h,C,k ) h m (C) = false . . adv(h,S,k) < eg, ft >->< eg', ft' > 


<C on C' fe eg, h S>—>-C on C^ 1 eg, /i 


( on -) 


< egi, /i eg', h' > 


< egi,..., eg i; ..., eg„, h >->■< egi,... , eg',..., eg„, h' > 


4; <S ,fe eg, /i S>—S fe+1 eg', hf S> 

(interleave) 


(grd) 


Figure 4: Kahn process network semantics. Predicate adv(h, X, k ) is defined as len(h(X)) > k. 
Term r(p, k) is defined as wait (Sf,..., S ^) done(Ti,..., T„) p 

that absence of a variable cannot be reacted upon (only presence). Execution traces are repre¬ 
sented with histories assigning sequences of values different from nil to each variable identifier. 
Given a set of variables V, we denote with Hist(V) the set of finite histories h assigning to 
each v £ V a sequence of values h(v) £ Type(v)*. On Hist(V) we introduce the concatenation 
operation, defined pointwise by ( hih 2 )[y ) = h\{v)h 2 {v) for all v. Operator len(x) provides the 
provides the length of a string. 

The asynchronous observation of a synchronous trace discards synchronization, retaining for 
each variable the sequence of values different from nil. It is defined as 6 : TZy* —» Hist(V) 
where <5(fif 2 ) = S(ti)S(t 2 ) for all t\,t 2 £ Xy* and 8{r)(v) = r{v) if r £ 7 Zy with r(v) yf nil and 
J(r)(u) = e (the empty word), if r(v) = nil. 


2.2.2 Program state 

In the classical definition of Kahn process networks |7], system state is fully determined by 
the state of the processes and the state of the communication channels. The state of each 
communication channel can be represented with the sequence of messages produced, but not yet 
consumed on that channel. 

In InteLus, under asynchronous semantics, equations are stateless. This is true even for the 
fby equation, whose internal state is naturally conflated with that of the output variable. On 
the other hand, point-to-point channels are replaced with multi-cast variables that can be read 
by multiple equations. For this reason, we represent states of a program or statement p as pairs 
C p', ft > where: 

• h £ HistiV), where V is the set of variables of p. It gives the finite (possibly empty) 
sequence of values different from nil that were assigned to each variable since the beginning 


(fcall) 
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of the execution. If v is the output of a f by equation, h{v) also contains, on the last position, 
the internal state of the fby (its constant). 

• p' is an annotated term over p. The terms are based on those of the synchronous semantics. 
In addition, for each equation eg that reads a variable v, the use of v inside the equation is 
annotated with a non-negative integer defining the position in h(v) that the equation will 
read at its next execution. Indices in h(v) start at 0. 

Note that there is some redundancy between the fby constants in the term and the same val¬ 
ues stored in the history of the fby output variables. We keep both for consistency with the 
synchronous semantics. The initial state of a program/statement p is <C po, ho 3>, where the po 
is obtained from p by annotating each variable read point with 0. If v is defined by equation 
v = k fby y then ho(v) = k. For all other variables ho(v) = e. 

We say that a state <C p, h is complete if for every variable v defined by v = k fby y 
all read pointers are equal to len(h(v)) — 1, and for all other variable w the read pointers are 
equal to len(h(w)). Complete states can be put in direct relation with states of the synchronous 
semantics. We denote with strzp(<C p, h 3>) the synchronous state term obtained from Cp,/i> 
by removing the history h and the read point annotations. 

2.2.3 Semantic rules and semantics preservation 

Semantics is presented in Fig. [4j under SOS form. Concurrency is by interleaving. At each 
transition, exactly one equation is executed among those that can be executed (cf. rule (inter¬ 
leave)). The rules for guards (on*) and (grd) are optional. Without them, the semantics works 
for programs without guards. With the rules, equations can have guards. Input variables are 
not considered in the semantics. It is assumed that each input is read by exactly one equation, 
and that a fresh value is provided at each execution of that equation. 

Predicate adv(h,v,n) determines whether we can read the n th element of h(v) in history h. 
It is used to determine if enough input is available to enable the execution of a transition. 

The synchronous and Kahnian asynchronous semantics of InteLus are tightly related by the 
following result: 

Theorem 1 (Semantics preservation) Let p be a correct synchronous program. Then: 

• For any r\...rk £ Traces(p) there exists a complete asynchronous trace -C Po,ho 3>—> 

••• Pn,h n ^> such that: h n = S(n ... r k )S(r) (1) 

where r(v) = e if v is not the output of a fby, and r(v) = k if k is the state of the 
fby defining v in p n . When this happens, the final state of the synchronous trace is 
strip(<€. p n ,h n >). 

• For any asynchronous trace <C pi, hi 3 >—> ... — p m , h m of p there exists an extension 

< Pi,hi 3>—>• ... — Pn , h n (n > m) that is complete, and there exists r\...r k £ 
Traces(p) satisfying property (1) above. 

Proof sketch: Direct application of Theorem 4.4 of m- The tedious part of the proof consists in 
interpreting equations of p as micro-step state transition systems (pSTS ) and proving that the 
synchronous (resp. asynchronous) composition of the pSTSs associated to equations faithfully 
represents the synchronous (resp. asynchronous) semantics of p. Once this is done, Theorem 4.4 
can be applied, after noting that the synchronous correctness of the program ensures that the 
asynchronous composition of pSTSs is non-blocking.■ 
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2.2.4 Properties ensuring implementability 

The Kahnian semantics is the natural semantic framework for defining a series of correctness 
properties ensuring the implementability of our programs. We briefly present here only 3 such 
properties. These properties are amenable to low-complexity verification and synthesis through 
the use of sufficient properties defined in the synchronous model. 

Boundedness. When the program of Fig. B a ) is executed under Kahnian semantics, the 
producer may be executed an indefinite number of times before the consumer is executed once. 
Such a behavior requires infinite storage, and is not amenable to static resource allocation. It 
must be prohibited through the introduction of wait/done dependencies. Our objective is that 
implementations require the storage of at most one value at a time for each variable. Semantically, 
this amounts to requiring that the dataflow part of the implementation program satisfies the 
following property: 

Definition 1 (1-boundedness) An InteLus program p is called 1-bounded if in any reachable 
state <C p ', h! of the Kahnian semantics, for every variable v that is not an input and for 
every read point annotation x of v, we have: len(h'(v)) — x < 1. 

Implementation of fby with no internal memory. The general translation of fby into C 
code requires the use of internal storage. However, all memory must be exposed to optimization, 
so we always impose scheduling constraints (through wait/done dependencies) ensuring that 
internal storage is not needed]^] 

Explicit synchronization. Under dataflow semantics, data accesses are synchronizing. Under 
machine semantics, they are not. Assume that a variable v is used for communication between 
equations eq\ and that will be mapped on different threads. Then, variables of type event, 
later mapped onto semaphore operations, must be used to represent the synchronization asso¬ 
ciated to these data accesses. The completeness of the synchronization can be checked on the 
dataflow implementation model. Indeed, if enough synchronization has been added, then the 
Kahnian semantic transitions of eq 2 do not need to perform the synchronization tests on v (the 
adv conditions in Fig. |4j). 


3 Implementation modeling 

This section starts with a presentation of the architectures we target, including a precise descrip¬ 
tion of their API and of the desired implementation structure. Based on this description, we can 
introduce the implementation model: its syntax, intuitive and formal semantics. 

3.1 Target architectures 

For the scope of this paper, we shall consider multi-core shared memory architectures without 
shared caches and without hardware cache coherency between LI caches. This definition covers 
classical multi-core processors when their shared caches are disabled (often the case in embedded 
contexts), or parts of larger architectures, such as the computing tiles of the Kalray MPPA 256 

®Even stricter rules can be imposed, to reduce memory consumption or enforce internal coding rules. In our 
avionics case study we ensure that the input and output variables of a fby can share the same memory location, 
which also means that no code is needed in the implementation for fby equations. Such stricter rules must be 
used with care, because they can introduce deadlocks. 
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many-cor^J which we use as test platform. Due to presentation size restrictions, we do not cover 
here issues related to the interaction between the shared memory (sub-)system we consider and 
its environment (communication buses, on-chip networks, DMA units, etc.), or to the use of a 
memory management unit (MMU). We assume that our platforms have a single address space 
with physical addressing and no protection. 

3.1.1 API and ABI 

We assume that our target architectures provide a minimal set of services, and we standardize 
access to them by means of a simple API. Similarly, we make a number of assumptions on the 
way software is deployed. These assumptions can be seen as an extended application binary 
interface (ABI). The definition of the semantics of our implementations heavily relies on the API 
and ABI defined next. Modifications on any of the two may require changes to the operational 
semantics defined next. 

Cache consistency. Software running on one CPU core can enforce consistency between its LI 
data cache and the shared RAM using 2 primitives. Primitive inval (addr) invalidates the cache 
line containing address addr. The next access to an address in this cache line is guaranteed to 
be a cache miss, which forces loading from the shared RAM. Primitive f lush(addri, ..., addrk) 
forces the writing of the words at addresses addr \,..., addrk to the shared RAM, and enforces a 
memory barrier afterwards. When the execution of flush completes, all the words have already 
been written to shared RAM. 

Semaphores. Synchronization between threads is done using binary semaphores. A fixed, 
finite set of semaphores is provided. The state of all semaphores is initially false. Synchro¬ 
nization is done using 2 primitives. Primitive signal (l) should only be called when the state of 
sempahore l is false (otherwise, the behavior is undefined). It changes the state of l to true. 
Primitive wait (l) waits until the state of semaphore l is true and changes the state to false. 
When multiple wait(Z) statements are waiting when signal(Z) sets the semaphore to true, then 
only one of them is non-deterministically chosen and executed. Our implementations will ensure 
that no concurrency exists between wait(Z) calls. 

ABI We shall consider in this paper only statically scheduled, bare metal implementations. 
Like in our example of Fig. [TJ each CPU is assigned one sequential thread - a function that never 
terminates. An initialization function is called by one of the threads when execution starts. A 
global synchronization barrier (function global_barrier) separates initialization from the rest 
of the execution. We assume the software is already deployed, meaning that we do not cover 
here boot-related issues. Each thread is loaded at a fixed address along its local data (if any). 
The same is true for functions called by threads. For each thread, the stack base is statically 
defined. 

3.2 Mapping information 

This section provides the description of the red syntax elements of Figures m and [3] which 
define the mapping of the data-flow variables and equations onto resources of the target platform. 
Mapping annotations have 3 functions. First, they partition the data-flow equations in 3 
classes: equations sequenced into threads, equations describing the behavior of platform devices 
(in our case semaphores), and equations that do not need sequencing into threads due to specific 

7 www.kalray.eu/kalray/products/#processors 
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L([s]) = true 

• [wait : [s]]eq, M, L —> [wait : [s]]eq», M,wait(L, [s]) 

p 

L([s]) = false 

• [signal : [s]]eq, M, L —> [signal : [s]]eq», M, signal(L, [s]) 
P 

L([s]) = true 


(wait) 


(sig) 


*[signal : [s]]eg, M , L —> Err 


(sig-err) 


• [dinval : adr ] eq,M,L —> [dinval : adr] eq», dinval(M,p,adr), L (dinval) 

p 

• [dflush : ad?'] eq,M,L —> [d/Zws/i : ad?’] eq»,dflush(M,p,adr), L (dflush) 

p 


M([C}) = (R(_, (D true))) Ad([Y]) = (R (_ , (D y))) M([X}) = (R (0, _)) 


(A) = Y when C, Ad, L —> (X) == Vr’ ^ when C, enter(M,p, (A'}, (C, Y}), L 


(when+) 


M([C]) = (R (_ ,{D false))) 


(X) = Y when C , M, L —>• (X) = Y when C , enter(M,p , {}, {C}), L 
p 

conditions of (when+) or (when-) not satisfied 


(when-) 


(X) = Y when C, M, L -> Err 
p 


(when-err) 

M([C]) = (R , (D true))) M([Yi]) = (fl (_ , (g y))) M([A]) = (R (0, _)) 

(A) = merge C Y 1 Y 2 ,M,L -»• (X) = •£=» merge C Yi Y 2 , enter(M, p, {A}, (C, Yi}), L 

Af([C]) = (g (_ , (D false))) M([Y 2 ]) = (R (_ , (D y))) M([A]) = (R (0. J) 

* (A) = merge C Y\ Y 2 ,M,L^ (A) = •£=» merge C Y? Y 2 , enter(M, p, {A}, {C, Y 2 }), L 


(merge!) 


(merge-) 


conditions of (merge-}-) or (merge-) not satisfied 


• (X) = merge C Yi Y 2 , M, L-> Err 

p 

conditions of (fcall) not satisfied 
(Xi,.. ., Xn) = fid(Yu • • •, Ym), M,L -► Err 

p 


(merge-err) 


(fcall-err) 


(Xi, 


V? : M([Y]) = {R (__ ,(D ???))) Vi : Ad([Aj]) = (fl (0, _)) ... ,x n ) = fid( yi ,... ,y m ) _ 

, A„) = fid(Y\,.. ., Y m ), Ad, L -> (Ai,..., A„) = •^=y^' Xn=Xn ,enter(M,p, (Ai,..., A n }, (Yi,..., Y m }), L 

M([Y]) = (it (_ , (J> y))) Ad ([A]) = (g (0, _)) fc ± nil [A] ^ [Y] 

• (A) = k fby Y,Ad, L —> (A) = y fby Y, enter! At,p, {Y}, {A}), L 

P 

conditions of (fby) not satisfied 


(fcall) 


(Xi,..., X n ) = i 




(fby-err) 

expr, M, L —> (Xi,..., X n ) = expr •, exit(M,p, {Xi,..., X n }, {Yi, ..., Y m }), L (eq-exit) 


• (X) = k fby Y, M, L -> Err 

p 


Figure 5: Platform semantics, part 1. Rules for API call equations (top), rules for dataflow 
equations (bottom). Error rules in gray. 
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M([C]) = ( R (_ ,1)) with l.cache(p) = (D true) 

• on C aeq, M, L —> on C • aeq, M, L 
v 

M([C]) = (R (_ , l )) with l.cache(p) = (D false) 
• on C aeq, M, L —> on C aeq», M , L 


(on+) 


(on-) 


conditions of (on+), (on-) not satisfied 


aeq, M, L —> aeq', M', L' 


•on C aeq, M, L —> Err 


(on-err) 


on C aeq, M, L —> on C aeq', M', L 


- (on-comp) 

' M' 77 v ^’ 


aeq, M, L —> Err 
v 

on C aeq, M, L —> Err 


(on-comp-err) 


aeq, M, L —> aeq', M', L' aeq, M, L —> Err 

- , - P -——— (grd-comp) — ,, f -—— (erd-comp-err) 

top aeq, M, L —> top aeq', M', L' top aeq, M, L —> Err 

v v 

•top aeq, M, L —> top • aeq, M, L (grd) 

v 


thread eqi ... eqk», M, L —> thread • eqi ... eqk, M, L (seq-loop) 


eqt, M, L —7 e(j,, AI , L 


thread eqi . .. eqi... eqk, M,L —> thread eqi ... eq[... eqk, M', L ' 

(seq-err) 


(seq) 


eqi, M, L —> Err 
v 


thread eqi ... eqi... eqk, M, L —> Err 

v 

ti,M,L —> t'i,M',L' 

---T- (interleave) 

ti • • • U ... tk, M, L -> ti... tt... tk, M 1 , L’ 


ti, M, L —> Err 
v 


ti... ti... tk, M, L —> Err 

p 


(interleave-err) 


Figure 6: Platform semantics, part 2. Rules for guards (top), rules for sequential and parallel 
composition (bottom). Error rules in gray 
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mapping choices. Such are the fby equations satisfying the conditions of Section |2.2.4 The 


latter are always placed first in an implementation program, like that of Fig. [3j Partitioning 
annotations also include mapping information such as the allocation of threads to CPUs, the 
name and configuration of semaphores, etc. 


Second, mapping annotations define memory allocation. In our mapping approach, these 
annotations cover functions, variables, and threads. For functions that are not inlined, we define 
the base address where both code and data are allocated. It is assumed that the entry point of 
the function is at that address. For threads, annotations define the addresses for code (and local 
data) and stack. Variables of type event are not allocated memory locations. All other variables 
are. Allocation annotations distinguish between variables viewed through different processor 
caches (which carry the on annotation) and those representing the state of the RAM. 

Third, mapping annotations provide the API implementation of various dataflow 
equations. 


The only platform-specific devices used in this paper are semaphores. They represent token¬ 
passing protocols where the only way to create a token is by a call to signal, and the only way 
to consume a token is by a call to wait (no token can be lost). It is also required that at most one 
token can traverse a semaphore at a given time. It can be verified that the dataflow equations 
of a semaphore satisfy these properties. When this happens, a platform implementation using 
instead the signal and wait API primitives will function correctly and preserve the dataflow 
semantics. The platform semantics will only consider the semaphore declaration, not the dataflow 
description of the protocol. This declaration includes its name, the event type variables that 
bring tokens into the semaphore, the event type variables that take tokens out of the semaphore, 
the initialization value of the semaphore, and the platform lock on which the semaphore is 
allocated. 


3.3 Platform semantics 

The platform semantics provides an operational description of the platform execution, once the 
application has been mapped on it and once the initialization code has been executed. It is not 
meant to be a semantics for general multi-threaded implementations. Instead, it only covers the 
very restricted control structures of the implementations we target. Furthermore, when compared 
to classical cycle-accurate simulation semantics, it already includes elements of abstraction meant 
to hide timing aspect^] and to isolate the semantics of the well-formed code synthesized from 
the implementation model from the semantics of the potentially more general C code of external 
functions (such as Prod or Cons in our example). This isolation means that we cannot know 
the exact position of control (the program counter) during execution of a function, nor the way 
the function manipulates data during computations. For space reasons, the platform semantics 
presented here covers only dataflow programs where the guard triggers are top. 


3.3.1 State representation 

In the platform semantics, states provide an abstract view of the state of the execution platform 
components, including CPUs, memory hierarchy, and semaphore status. They have the following 
structure (for clarity, we use an OCaML syntax for its definition): 


8 This form of abstraction was also employed in the concrete semantics of m- 
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type mach_state = Err 

I Norm of ctrl_state*mem_state*lock_state 
type mem_state = addr -> mem_loc_state 
type mem_loc_state = W I R of posint*loc_value ; 
type loc_value = { ram:value; cache:processor -> value; } 
type value = D of word I U 
type lock_state = lock -> bool 


The remainder of this section details the 3 components of the machine state and defines state 
manipulation functions. 

Control flow state. It is represented through annotation of the implementation program. The 
program counters of each CPU are represented with bullets (•) placed in the body of threads. To 
allow the identification of all possible memory access or synchronization interferences between 
threads, bullets can take the following positions: 

• Between guarded equations of the thread, to represent the state where the execution of 
one equation is finished, but the next (including possible guard tests) has not yet started. 
When control loops back after the execution of the last equation of a thread, the bullet if 
first placed after the last equation, and then before the first equation of the thread. 

• Before a guard test “on C”, to represent the state where the execution has reached, but not 
executed, the associated variable test. 

• After all guard and synchronization annotations, to represent the state where the guard 
has been successfully traversed, but the equation execution has not started. 

• Inside the equation, to represent the state where its execution has started, but is not 
completed. When inside an equation, the token is annotated with the memory locations 
corresponding to variables in use by the equation execution (read or written). These values 
are annotated by the transition entering the equation, and removed by the equation exit 
transition. 

In the initial state, in each thread, the bullet is placed before the first equation. The set of all 
possible such state representations obtained by bullet annotations is denoted ctrl_state. 

Memory system state. It includes the state of the RAM and that of the LI data caches of 
the CPUs. We denote with addr the set of memory addresses that can be used by data-flow 
variables. At each instant, a memory location can be either written by some equation (as an 
output variable), or read by zero or more equations, as an input variable. Performing a read 
access on a variable that is currently written by another equation, or a write access on a variable 
that is currently written or read by another equation is an error. In the definition of type 
mem_loc_state, variant W corresponds to the write state, and posint gives the non-negative 
number of readers in the read access case. 

In the case where a location is read, its value can be undefined (U) or defined (D) with a word 
value. The U value represents a state where the value of the memory location is unknown. 

In a given state, read accesses from different processors to the same memory location may 
produce different values. To allow reasoning on this memory consistency issue, we need to store, 
for each memory location, not one, but proc + 1 values, where proc is the number of CPU cores. 
Type loc_value structures the storage of these proc- 1- 1 values. The value stored on record field 
ram is that stored in RAM. The value stored on cache (p) is the one that the an access from 
CPU core p to its LI cache would return. 
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The initial memory state is determined by the initial value of fby equations of the program: 
All locations are in state (R (0,v)), where v. cache(p)=U for all p. Memory locations not 
corresponding to fby output variables also have have v. ram=U. For the output of a fby equation 
initialized with d, the initial state sets v.ram=(D d). 

The semantics updates memory state using 4 functions whose prototypes are given below (in 
OCaML syntax). The first two correspond to the dflush and dinval API calls. The remaining 
2 implement the semantic actions taken upon entering and exiting an equation. 

dinval: mem_state*processor*addr -> mem_state 
dflush: mem_state*processor*addr -> mem_state 
enter : mem_state*processor*(addr list)*(addr list) 

-> mem_state 

exit : mem_state*processor*((addr*word) list)*(addr list) 

-> mem_state 


We call dinval(M,p, addr), resp. dflush (Ad,p, addr), upon execution of the corresponding 
API primitive in memory state Ad, on processor p, and with argument addr. The functions can 
only be applied when Ad (addr) = (R (n, v)). Their action changes v as follows: dinval(m, p, addr) 
sets v.cache(p) to v.ram', dflush (Ad, p, addr) sets v.ram to v.cache(p) and v.cache(q) to U for 

all q^p. 

We denote with [A] the memory location of variable A. Function 

enter (Ad, p, {Xi,..., X n }, {Yi,... ,Y m }) is called when starting the execution of an equation 
that reads variables Yj,... , Y m and writes variables X±, , X n . The call sets the memory state 
of the locations [Xi\ to W, and increments the read counter of the locations [Yj] whenever [Yj] is 
not also the location of an output variable Xj. 

Function exit (Ad,p, {(Ai, aq),..., (X n , x n )}, {Yi,..., Y' m }) is called when completing the ex¬ 
ecution of an equation that reads variables Y), ..., Y m and writes variables Ai,..., X n , when the 
final values for the written variables are respectively Xi,i = 1 ,n. The call decrements the read 
counter of the locations [Yj] whenever [Yj] is not also the location of an output variable. For each 
of the locations [Ai], it sets its state to (R (0,u)) where v.cache(p) = (D aq), v.cache(q) = U for 
all q ^ p, and v.ram = U. 

For simplicity, like in the Kahnian case, platform semantics does not fully consider input 
variables. Each input variable is given a memory location different from all other variables, it is 
read by only one equation, and its value is assumed ready upon reading. 


Semaphore state. The binary semaphores provided by the execution platform are called locks. 
Their state is represented with Boolean values. We update lock_state objects using 2 functions 
corresponding to the wait and signal API calls: 


signal: 

lock_state*lock 

-> lock_state 

wait : 

lock_state*lock 

-> lock_state 


Function signal(L,Z) can be called only when L(l) = false. It sets L(l) to true. Conversely, 
wait (L,l) can be called only when L(l) = true. It sets L(l) to false. 


3.3.2 Semantic rules 

Platform semantics is also provided under SOS form. Transitions have the form si —» S 2 , where 

p 

si and S 2 are objects of type mach_state and p is the processor performing the state transition 
(recall that it is an interleaving semantics). Figures [5] and [b] provides the SOS rules. States 
Si = Norm (ctrl, mem, locks) are represented here as ctrl, mem, locks. 
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3.3.3 Correctness and semantics preservation 


An implementation program is correct in the platform semantics when it satisfies two properties: 
it does not block or enter Err state, regardless of the values read from input variables, and its 
execution under platform semantics preserves the semantics of its dataflow part under Kahnian 
semantics. Semantics preservation is defined as the preservation of the sequences of values taken 
as input (as guards or actual equation inputs) and produced as output by the execution of the 
various equations (which can be represented with histories, introduced in Section 2.2.21. 

To be correct, an implementation program must satisfy the properties mentioned in Sec¬ 


tion 


2.2.4 and, in addition, must satisfy properties ensuring the correct use of the (sequential) 


resources of the platform (not presented here for lack of space). 


4 Experimental evaluation 

This paper introduces a language, not a mapping tool. From this perspective, we have already 
provided the elements showing that we can model implementations featuring optimized resource 
allocation. As a supplementary element of proof, this section shows that our language allows 
the representation of the optimized implementation of a real-life case study, produced by an 
automatic mapping tool. 

The case study is a complex piece of critical avionics software, including more than 6000 
unique dataflow nodes and 36000 variables. Its current implementation is sequential and time- 
triggered. Periodically, a sequence of 24 non-preemptive sequential tasks (built out of nodes) 
are triggered at fixed time intervals. Our first objective was to parallelize each of the 24 tasks, 
taken separately, and demonstrate the correctness of the implementations. Each task has been 
transformed into an InteLus specification and an automatic mapping tool performed allocation, 
scheduling, memory allocation, and the synthesis of the synchronization and memory coherency 
protocols. Implementation programs were synthesized, from which C/ldscript code allowing 
compilation was generated. Using the results of this paper, the correctness of implementation 
can he formulated. 

The following table shows various characteristics of the largest 6 tasks (in number of nodes) 
and averages on the 24 tasks. The first important figure concerns memory allocation (Mem), 
where reuse allows a 71% reduction over the number of specification variables. We exclude 
input variables from this table, because they are currently excluded from optimization (work in 
progress). 


Task 

Specification 

Implementation 

Nodes 

Vars* 

Mem* 

Sema 

Locks 

Speed-up 

T1 

1511 

6077 

1951 (32.1%) 

1040 

96 (9.2%) 

10.45x 

T17 

1115 

5008 

1603 (32%) 

958 

108 (11%) 

9.85x 

T9 

1090 

4813 

1537 (31.9%) 

907 

120 (13%) 

11.72x 

T8 

1289 

5239 

1351 (25.8%) 

1291 

115 (8.9%) 

13.28x 

T24 

1313 

5894 

1294 (22%) 

1370 

118 (8.6%) 

12.49x 

T16 

1258 

4945 

1247 (25.2%) 

1296 

118 (9.1%) 

12.97x 

Avg. 



28.8% 


10 % 

10.52x 


* Input variables not included. 


The number of semaphores (Serna) representing point-to-point thread synchronizations is 
large, because very fine grain synchronization is used to allow efficient resource allocation. How¬ 
ever, the cost of these synchronizations is very low (a few CPU cycles each) and, most important, 
very efficient reuse of the 127 hardware locks of the platform allows the implementation of the 
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semaphores. The speed-up figures are produced by an off-line scheduling model not taking into 
account memory interferences, and thus are not safe. However, they provide insight into applica¬ 
tion structure, and our implementation method should preserve functional semantics regardless 
of timing imprecision. 


5 Conclusion 

We have provided proof of our initial claim: Dataflow specifications can be given efficient multi¬ 
threaded implementations that preserve an internal dataflow structure. The dataflow structure 
of such implementations can be exposed using new language constructs. Doing so facilitates 
formulating the correctness of implementations, and also proving it, through the use of results 
such as Theorem [T] or through the use of analysis techniques specific to the dataflow domain. 

Work will continue along two axes. We will enrich the implementation modeling language to 
incorporate new features. The first steps will be to include real-time capabilities and to diversify 
the target architectures by considering communication devices (DMAs, networks, external RAMs, 
etc.). The second objective is to develop a formally proven translation validation tool covering 
the transformation of functional specifications into implementation programs. This tool needs to 
be complemented later with the validation of the translation from implementation programs to 
code running on the platform, and with the validation of the translation from Lustre to InteLus. 
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A Synchronous semantics of InteLus 

To simplify the definition of the semantics, and without affecting generality, we shall assume that 
literals only appear as the first argument of a fby statement, that no identity equations “x=y” 
exist, nor lvalues, and that the input and output variables of a fby statement are different. 
This can be achieved through the use of constant and identity functions. We also make the 
hypothesis that each equation has (possibly empty) wait and done constructs. 

A.l State representation 

The fby statements are the state elements of a InteLus program, each of them holding either 
last value of the input variable (if any) or the initial constant of the fby statement. The current 
state of a program can be represented by replacing the constants of the fby statements by their 
current state. If we consider the fragment on the left: 

(y) = add(x,x) (y) = add(x,x) 

(z) = 10 fby y (z) = 6 fby y 

then its initial state is the fragment itself. After one cycle where x has value 3, the new state is 
the fragment on the right. Note that the all program states are identical upto a constant change 
in fby equations. 

A. 2 Notations 

Recall, from Section |2.1[ that TZy is the set of all mappings r assigning to each v G V a value 
r(v) G Type(v)U {nil}. We extend this notation. We denote with TZy the set of partial valuations 
over V. Given r G TZy we define its support supp(r ) as the set of identifiers to which r assigns 
a value (possibly nil). We also define f as the (partial) mapping that retains of r only the 
valuations different from nil. Note that TZy C TZy. 

We introduce a simple notation for (partial) mappings from variable identifiers to variable 
values. We denote with (X, i —> aq \ i = 1, n) or (X-\ \-)■ x\,... ,X n H > x n ) a (partial) mapping 
assigning value Xi to identifier Xi, i = l,n. The empty mapping is denoted (). The notation is 
ambiguous, as the domain of the mappings is not explicit. 

If ICV and r G 7 Zy, then r |/G IZi is the restriction of r to the identifier set I. Note that if 
r G TZy, then r |/G 7 Zj. If X C V, then 7 Zj is naturally injected in TZy. By abuse of notation, we 
shall say that 7 Zj is included in TZy. 

If ri,r 2 G TZy such that all the bindings of rq are also bindings of r 2 , then we write r i <r 2 - 
Relation < is a partial order on TZy. We denote with U the partially defined least upper bound 
operator associated with < on TZy. The completely defined greatest lower bound operator is 
denoted D. If r G TZy and ICf, then we define r\I=(Xi-tx€r\X£I}. 

A.3 Structural operational semantics rules 

We present the behavior of statements, code fragments, and full programs under the form of 
transitions defining the new state and variables assigned in one cycle for given input state and 
input variable assignments: 

A / 

eqs =$■ eqs 

Here, code fragments eqs and eqs' define the input and result program states, obtained by 
changes in the fby constants. If X is the set of variables taken as input by eqs, then I G TZx 
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Z = merge C X Y 


x ^ nil 

(Z\—¥x) 


(Ci—^true, X i— Vx^Y mil) 

V ^ nil 


Z = merge C X Y 


(merge+) 


Z = merge C X Y 
Z = merge C X Y 


{Z^,y) 


(Ci—alse,Xi—mil, Y i —Yy) 
(Zt—t nil) 


Z = merge C X Y 


(merge-) 


(Ci—mil,-Xi—mil,Y i—mil) 

x nil 


Y = X when C ■ 


(Yi-tx) 


Z = merge C X Y (merge-nil) 
(when-)-) 


( C i—t rue, X i—> x ) 

x ^ nil 


Y = X when C 


Y = X when C 
Y = X when C 


(Y i—7-nil) 


(Ci—alse, X*-¥x) 
(Y i—mil) 


Y = X when C 


(when-) 


(Ci—mil, X i—mil) 

(yi, • • • 5 Vm ) — F fid (xi, ■ ■ • , £ri) 


Y = X when C (when-nil) 


(Yi,.. 
(Yi,... 
(Y) = k fby X 


, Ym) = fid(X 1,..., X n ) (Y 1( ...,Ym)= fid(X 1, ■ ■ ■, X n ) 

(XjiH >xi \i=l,n) 


(fun) 


, Yi) = fid(X i, ■ ■ •, X k ) < y, ^ nil "~i^- (Yi,..., Y{) = fid( X u ..., X k ) (fun-nil) 

(X^i—mil|i=l,n) 

(y) = k' fby X (fby) (Y) = k fby X ry\ = £. fby x (fby-nil) 

(Xh+k 1 ) (Xrtnil) 


p=>p' /UO^O Vi,j:Si,Tjgsupp(lUO) 
wait(5i, . . . , S k ) done(Ti,... , Tj) p OU<T ‘— top|i . wait(Si, .. . , S k ) done(Ti,. .. ,Tj) p' 

lU(Si^-top\i=l,k) 


O 

p=>p 


I U O = {) Vi,.) : Si,Tj qL supp(I u O) Vi, j : Si ^ Tj 


wait(Si,..., S k ) done(71,... ,Tj) p ° U<T ‘^ nlI|i . ld \ wait(Sr,..., S k ) done(Ti,... , Tj) p 

mil|i=l,fc) 


(wait-done) 


(wait-done-nil) 


Pi =!> p'i,i = 1,2 supp(Oi) Cl supp{02) = 0 supp(0 2 ) H supp(I i) = I 
shuffle(pi,p 2 ) °* u ° 2 ==> s huffle(p' 1 ,p' 2 ) 

hu(/2\supp(Oi)) 


(compose-acyclic) 


Pi =)> Pi,i = 1, 2 strip(p 2 ) = “Y = fe fby X" supp(Oi), supp(0 2 ), {Y} mutually disjoint 
supp(0 2 ) H supp(h) = 0 Y >->■ k' € Ii 
shuffle(pi,p 2 ) =====* shuffle(p^,p' 2 ) 


(compose-fby) 


JlU(/2\supp(Oi)) 

body(p ) =§> body(p') (I U O) maximal in the sense of < 

"2 ( 

p =$> p’ 


(program-closure) 


Figure 7: Operational semantics of InteLus’ subset without guards. Each rule implicitly requires 
that the support of input and output in the inferred transition are exclusive. The shuffle operator 
takes in two equation lists and can output any equation list obtained by shuffling the two initial 
ones. 
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with supp(I) = X. Similarly, if O is the set of variables assigned by eqs, then O £ TZq with 
supp(0 ) = O. Causal correctness requires that In 0 = 0. 

The structural operational semantics rules allowing to derive transitions are provided in 
Fig. 0 Note that, according to rule (program-closure), program transitions are transitions 
of the program body with maximal activity, which amounts to an ASAP (as soon as possible) 
scheduling policy. To understand the importance of this rule, consider the following input-less 
program, which encodes a counter that increments x by 1 at each cycle. 

var x:word y:word 
let 

x = 1 fby y 
y = add(x,1) 
tel 

Semantic rules (fun-nil) and (compose-fby-nil) allow the construction of a program body tran¬ 
sition where both variables are assigned value nil. However, rule (program-closure) forbids it, 
forcing the counter to advance at each cycle. 

A.4 Clocks 

In synchronous languages, clocks are used to represent activation conditions. Clocks represent 
sets of execution cycles, and can be seen as predicates over the state and inputs of the program. 

Clocks can be compared. We say that ck\ and ck ,2 are equal, denoted ck\ = cfc 2 if they rep¬ 
resent the same execution cycles, i.e. if the associated predicates are true at the same execution 
cycles. We say that ck\ is a sub-clock of cfc 2 , denoted ck\ < cfe if whenever the predicate of ck\ 
is true in a cycle, the predicate of ck 2 is true, too. The least upper bound and greatest lower 
bound are fully defined and denoted V and A, respectively. The empty clock is denoted 0, and 
the clock that is always present, known as the base clock , is denoted . in formulas. 

Typically, clocks are use to represent cycles where some variable is present, or where some 
function is executed. We denote with clk(v) the clock representing the cycles where variable v is 
present. Consider, for instance, the following code fragment: 

(a,b) = f(y) 

(x) = y when c 
(z) = y whenot c 

The following clock relations will hold: 

• clk( a) = clk( b) = clk( y) = clk( c) 

• clk{x) < clk( c) and clk(z) < clk( c). 

• clk(x) A clkfz) = 0 and clk(x) V clk(z) = clk(y) 

• clk(z) = cZfc(y) A c, where c is the current value of variable c. 

Such relations are determined through static program analysis, known as clock calculus , and 
used to determine the consistency of the constraints imposed by program equations, thus proving 
program correctness. 
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eq => eq' I U O ^ (} Vi, j : C,Si, tj qL supp(I U O ) Vi, j : C ^ Si ^ tj ^ C 


"(eg) 


OU (sji—^top|i=l,fc) 


JU(C7i—^true)U(t ji—>top|i=l,£) 


=> r(eg') 


eg =4> eg /UO=() Vi, j : C,Si, tj qL supp(I U O ) Vi, j \ C ^ Si ^ tj ^ C 


f (eg) 


OU(s.£i—>-top|i=l,fe) 
/U(Ci—^false)U(tii—>top|i=l,i) 


’’(eg) 


eg => eg I U O = () Vi, j : C, Si,tj supp(I U O ) Vi, j : C ^ s; tj C 

, N OU(sji—»nil|i=l,fc) / . 

r(eg) _> r(eg) 


I U (Ci->-nil) U (ti H-mil \i=l,l) 


eq ^ eq' / U O / () 


5 eq 


o 


/U(si—>top) 


s eq' 


(trig-comp) 


eg => eg / U O = () 


s eq 


o 


(trig-comp-nil) 


JU(si—>nil) 

q 


5 eg 


eq => eg' / U O ^ () 


top eq => top eg' 


(trig-top) 


(on+) 


(on-) 


(on-nil) 


Figure 8: Operational semantics of guards. Rule (trig-top makes reaction to absence (the nil rule) 
unneeded. To facilitate rule writing, we denote r(eg)=”wait(ti,... ,ti) done(si,..., Sk) on C eq ” 
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A.5 Equation guards 

While clocks provide a logical representation of activation conditions, used for analysis, we also 
need an operational representation for them, used to generate the activation code of the imple¬ 
mentation. The constructions used for this purpose are the guards introduced by the syntax of 

Fig.m 

The semantics of the full InteLus language (with guards) is provided as an extension of that 
of Fig. [7] with the new rules of Fig. [8] Existing rules remain unchanged. 

Under the extended semantics, if an equation has a guard, and if the trigger is present in a 
cycle, then the decision to execute the equation is taken by the evaluation of (present) guard 
variables (if any), and not by the occurrence (or not) of a present/absent variable configuration 
(as it was the case in the semantics without guards). In particular, when the trigger is top, no 
absence decision is needed in the execution of the equation. 

When all equations of a program have a guard starting with trigger top, rule (program- 
closure) is no longer needed. This remains true when every maximal connected part of the 
dataflow contains at least one equation triggered by top. 

Note that, according to the extended semantics, different guards may represent the same 
logical activation condition (the same clock), but a different causality. Consider the following 
code fragment: 

top (b) = merge a bl b2 
top (c) = and(a,b) 
top on a on bl (x) = f(y) 
top on c (x) = f(y) 

Here, the last two equations can be equivalently used to define x, as the same value is computed 
at the same cycles (on the same clock). However, the two equations do not carry the same data 
dependencies due to the way the control (clocks) is computed. 
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